# load necessary packages
library(tensorflow)
library(keras)
library(data.table)
library(plyr)
library(stringr)
library(textstem)
library(tm)
library(purrr)
# load data
load(file="../Data/data_text.rda")
set.seed(100)
Lecture 7: Ensembling & Negative Sampling
36th International Summer School SAA
University of Lausanne
1 Introduction
This lecture covers Chapters 5 and 8 of Wüthrich et al. (2025).
2 Network ensembling
A critical point of network fitting is that it involves several elements of randomness. Even for a fixed architecture and fitting procedure, one typically has infinitely many equally good fitted models (‘solutions’).
The elements of randomness involve:
the initialization of the network weight \(\vartheta^{[0]}\) for SGD;
the random partition into learning sample \({\cal L}\) and test sample \({\cal T}\);
the random partition into training sample \({\cal U}\) and validation sample \({\cal V}\);
the random partition into the batches \(({\cal U}_k)_{k=1}^{\lfloor n/s \rfloor}\).
There are further random items like drop-outs, etc.
This makes early stopped SGD solutions (highly) non-unique. This non-uniqueness is typical for machine learning solutions.
2.1 Ensembling/nagging
Breiman (1996) introduced bagging for regression trees.
Bagging combines bootstrap and aggregating. Bootstrap is a re-sampling technique, and this is combined with aggregation which has an averaging effect, reducing the randomness.
We replace the bootstrap by different SGD ‘solutions’, because network fitting has the above mentioned items of randomness, we naturally receive multiple solutions.
Ensembling of network predictors was introduced by Dietterich (2000a) and Dietterich (2000b). Subsequently, it was studied in Richman and Wüthrich (2020), where it was called nagging for network aggregating.
2.2 Ensemble predictor
Having multiple (conditionally) i.i.d. predictors \((\widehat{\mu}_j)_{j=1}^M\), one builds the ensemble predictor \[ \widehat{\mu}^{(M)} = \frac{1}{M} \sum_{j=1}^M \widehat{\mu}_j.\]
This ensemble predictor has an estimation uncertainty \[ \sqrt{{\rm Var} \left(\widehat{\mu}^{(M)}\right)} = \frac{1}{\sqrt{M}}\, \sqrt{{\rm Var}(\widehat{\mu}_1)} ~\to ~0 \qquad \text{ for $M\to \infty$.}\]
The important takeaway is that ensembling over conditionally i.i.d. predictors substantially reduces estimation uncertainty.
Caveat: This does not say anything about a bias of the estimated model for the true model!
2.3 Nagging predictor
Based on learning sample \({\cal L}=(Y_i,\boldsymbol{X}_i,v_i)_{i=1}^n\), choose \(M\) conditionally i.i.d. fitted FNNs, where the conditionally i.i.d. applies to the elements of randomness in SGD fitting.
This gives us \(M\) conditionally i.i.d. FNNs \((\mu_{\widehat{\vartheta}_j})_{j=1}^M\), given \({\cal L}\).
This motivates the nagging predictor \[ \widehat{\mu}^{\rm nagg}_M(\boldsymbol{X}) = \frac{1}{M} \sum_{j=1}^M \mu_{\widehat{\vartheta}_j}(\boldsymbol{X}).\]
This ensembling reduces the fluctuations by a factor \(\sqrt{M}\).
The next plot shows how many FNNs we need to ensemble to arrive at an optimal forecast model.
Out-of-sample Poisson deviance losses as a function of \(M\ge 1\).
This is the French MTPL claims count example.
The Poisson deviance loss is scaled differently.
We conclude that we need roughly 10 to 20 i.i.d. fitted FNNs.
2.4 Results: nagging predictor
model | in-sample loss | out-of-sample loss | balance (in %) |
---|---|---|---|
Poisson null model | 47.722 | 47.967 | 7.36 |
Poisson GLM | 45.585 | 45.435 | 7.36 |
Poisson FNN | 44.846 | 44.925 | 7.17 |
nagging predictor | 44.849 | 44.874 | 7.36 |
For the nagging predictor we use \(M=10\) individual network fits.
As expected, we receive an out-of-sample improvement.
3 Tokenization of text and words
When it comes to large unstructured text inputs, a word embedding approach is fitted in an unsupervised learning manner (using the context). We illustrate this.
For word embedding, we change the notation to \(w\in \{1,\ldots, W\}\) labeling all the words in the available vocabulary by integers.
We start from a sentence \({\tt text}\) of length \(T\) \[ {\tt text}=(w_1,\ldots, w_T) ~\in ~ {\mathbb N}^{T}.\]
The goal is to find a sensible word embedding (WE) \[ \boldsymbol{e}^{\rm WE}:{\mathbb N} \to {\mathbb R}^b, \qquad w \mapsto \boldsymbol{e}^{\rm WE}(w), \] for embedding space \({\mathbb R}^b\); Bengio, Courville and Vincent (2014).
Based on unsupervised learning, one tries to learn embedding vectors from the contexts:
E.g., ‘I’m driving by car to the city’ and ‘I’m driving my vehicle to the town center’ uses similar words in a similar context.
Therefore, their embedding vectors should be close in the embedding space \({\mathbb R}^b\) because they are almost interchangeable.
The goal is to learn such similarity in the meanings from the context.
There are two different approaches:
Predict a center word from its context; a popular method is continuous bag-of-words (CBOW).
Predict the context from a center word; skip-gram is a popular approach.
For simplicity, we only present skip-gram. The other version is quite similar; we refer to Wüthrich et al. (2025).
3.1 Context of words
Consider a sentence \[\begin{equation*} {\tt text} = \left(w_1, \ldots, w_{t-1}, w_t, w_{t+1}, \ldots, w_T\right), \end{equation*}\] where the positional indices \(t\in {\mathbb N}\) become important now.
Aim: Predict the context words \((w_s)_{s\neq t}\) knowing the center word \(w_t\).
Start from a collection of different sentences \[ {\cal C}=\left\{{\tt text}=(w_1,\ldots, w_T)\right\},\] to which positive probabilities are assigned \[ p({\tt text})=p(w_1,\ldots, w_T)>0.\]
These probabilities should reflect the frequencies of the sentences \({\tt text}=(w_1,\ldots, w_T)\) in speech and texts.
Applying Bayes’ rule, one determines how likely a certain context occurs for a given center word \(w_t\) \[ p \left(\left. w_1, \ldots, w_{t-1}, w_{t+1}, \ldots, w_T\right|w_t \right)= \frac{p(w_1,\ldots, w_T)}{p(w_t )}.\]
In general, these probabilities are unknown, and they need to be learned from a learning sample \({\cal L}\).
Learning these probabilities will be based on embedding the words into a low-dimensional embedding space; this is the step where the adjacency of the word embedding is learned.
3.2 word2vec: skip-gram approach
A popular approach is the word-to-vector (word2vec) skip-gram approach of Mikolov, Chen, et al. (2013) and Mikolov, Sutskever, et al. (2013).
Since this problem is too complex, one solves a simpler problem:
- First, one restricts to a fixed small context (window) size \(c \in {\mathbb N}\) \[ p\left(\left. w_{t-c}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+c} \right|w_{t} \right).\]
- Second, one assumes conditional independence of the context words, given the center word \(w_t\).
Remark. Real texts do not satisfy this simplification, but this setup is still sufficient to obtain reasonable word embeddings.
Under the conditional independence assumption, we have log-likelihood on the learning sample \({\cal L}\) and for given context size \(c \in {\mathbb N}\) \[ \ell_{\cal L} ~=~ \sum_{i=1}^n \sum_t \sum_{-c \le j \le c, \,j\neq 0}\log p \left(\left. w_{i,t+j} \right|w_{i,t} \right).\]
Maximize this log-likelihood \(\ell_{\cal L}\) in the conditional probabilities \(p(\cdot|\cdot)\) to learn the most common context words of a given center word \(w_t\).
By embedding all words \(w\), one can learn the embeddings \(\boldsymbol{e}^{\rm WE}(w)\in {\mathbb R}^b\) by letting them enter the conditional probabilities \(p(\cdot|\cdot)\) and maximizing the resulting log-likelihood.
There is one special point: one needs two different word embeddings \(\boldsymbol{e}^{(1)}(w)\in {\mathbb R}^b\) and \(\boldsymbol{e}^{\rm (2)}(w)\in {\mathbb R}^b\) for center and context words, as these two play different roles in the conditional probabilities.
Assume the conditional probabilities are modeled by the softmax function \[ p\left.\left(w_{s}\right|w_{t}\right) = \frac{\exp \left\langle \boldsymbol{e}^{\rm (1)}(w_t), \boldsymbol{e}^{\rm (2)}(w_{s})\right\rangle}{\sum_{w=1}^{W}\exp \left\langle \boldsymbol{e}^{\rm (1)}(w_t), \boldsymbol{e}^{\rm (2)}(w)\right\rangle}~\in ~(0,1).\]
If the scalar/dot product between \(\boldsymbol{e}^{\rm (1)}(w_t)\) and \(\boldsymbol{e}^{\rm (2)}(w_s)\) is large, there is a high probability that \(w_s\) is in the context of the center word \(w_t\).
Collecting everything, one receives the log-likelihood function \[ \ell_{\cal L} ~=~ \sum_{i=1}^n \sum_t \sum_{-c \le j \le c, \,j\neq 0}\log p \left(\left. w_{i,t+j} \right|w_{i,t} \right).\]
Maximizing this log-likelihood \(\ell_{\cal L}\) for the given learning sample \({\cal L}\) gives us the two (different) word embeddings.
Optimization is done by variants of SGD.
3.3 Negative sampling
The above word2vec skip-gram approach is computationally expensive.
Negative sampling turns the above unsupervised learning problem into a supervised learning problem of a lower complexity; see Mikolov, Sutskever, et al. (2013).
For this, we consider pairs \((w,\widetilde{w}) \in {\cal W}\times {\cal W}\) of center words \(w\) and context words \(\widetilde{w}\). To each of these pairs we add a binary response variable \(Y \in \{0, 1\}\), resulting in observation \((Y,w,\widetilde{w})\).
There will be two types of center-context pairs:
real ones are from the learning sample \({\cal L}\), and we set \(Y=1\), and
fake ones that are generated purely randomly, and we set \(Y=0\).
Construct these two types of pairs as follows:
Extract all center-context pairs \((w,\widetilde{w})\) from the learning sample \({\cal L}\) and assign a response \(Y=1\) to these pairs, for indicating that these are true pairs. This gives the first part of the learning data denoted by \[{\cal L}_1=(Y_i=1, w_i,\widetilde{w}_i)_{i=1}^n.\]
Take all real pairs \((w_i,\widetilde{w}_i)_{i=1}^n\), and randomly permute the index of the context word indicated by a permutation \(\pi\). This gives a second (fake) learning data set \[{\cal L}_2=(Y_{n+i}=0, w_{n+i},\widetilde{w}_{n+\pi(i)})_{i=1}^n,\] with \(Y=0\) as response.
Merging real and fake learning data gives us a learning sample \({\cal L}={\cal L}_1\cup {\cal L}_2\) of sample size of \(2n\).
This turns into the supervised logistic regression problem \[\begin{eqnarray*}%\label{negative sampling solution} \ell_{\cal L} &=& \sum_{i=1}^{2n}\log {\mathbb P} \left[\left. Y=Y_i \right| w_i,\widetilde{w}_i \right]\\&=&\sum_{i=1}^{n} \log \left(\frac{1}{1+ \exp \langle -\boldsymbol{e}^{\rm (1)}(w_i), \boldsymbol{e}^{\rm (2)} (\widetilde{w}_i) \rangle}\right)\\&&+~ \sum_{k=n+1}^{2n} \log \left(\frac{1}{1+ \exp \langle\boldsymbol{e}^{\rm (1)}(w_{k}), \boldsymbol{e}^{\rm (2)} (\widetilde{w}_{k}) \rangle }\right). \nonumber \end{eqnarray*}\]
Maximizing this log-likelihood \(\ell_{\cal L}\), one can learn the two embeddings \(\boldsymbol{e}^{(1)}\) and \(\boldsymbol{e}^{\rm (2)}\).
For SGD training to work properly in this negative sampling learning, one should randomly permute the instances in \({\cal L}={\cal L}_1\cup {\cal L}_2\), to ensure that all (mini-)batches contain instances of both types.
3.4 Example: word2vec skip-gram with negative sampling
- We give an example being based on the claims texts of Frees (2020).
# remove stopwords, to lower case and remove white space at start and end
<- data_text %>% mutate(clean = Description %>% removeWords(stopwords("en")) %>% str_to_lower() %>% str_squish())
dat2 # remove numbers
$clean <- str_squish(removeNumbers(dat2$clean))
dat2# remove punctuation
$clean <- str_squish(removePunctuation(dat2$clean, preserve_intra_word_contractions = FALSE, preserve_intra_word_dashes = TRUE ))
dat2# remove damaged: because it occurs too frequently
$clean <- str_squish(removeWords(dat2$clean, words=c("damage", "damaged", "damge", "dmage", "dmgd", "damged", "dmaged", "damgae", "damgae", "damaging")))
dat2# lemmatize
$clean <- lemmatize_strings(dat2$clean, dictionary = lexicon::hash_lemmas) dat2
# tokenize cleaned data
= text_tokenizer() %>% fit_text_tokenizer(dat2$clean)
tokenizer1 # count the number of used words
<- texts_to_matrix(tokenizer1, dat2$clean, mode = "count")
text.matrix length(colSums(text.matrix)[-1])
[1] 1819
<- colSums(text.matrix)[-1]
words.used # minimal occurrence for embedding
<- 20
wwww <- length(words.used[words.used>=wwww])) (words
[1] 126
Since this is a very small dataset, we only embed the most frequent words, i.e., those that appear at least 20 times over all texts.
# we only embed the words that occur at least 20 times
<- text_tokenizer(num_words=words+1) %>% fit_text_tokenizer(dat2$clean)
tokenizer2 # this gives smaller texts
<- texts_to_matrix(tokenizer2, dat2$clean, mode = "count")
text.matrix # words used
<- length(colSums(text.matrix)[-1])) (max.words
[1] 126
# maximal sentence lengths
<- max(rowSums(text.matrix))) (maxlen
[1] 7
# tokenized sentences
<- texts_to_sequences(tokenizer2, dat2$clean) seqs
# true center-context pairs (we select a window size of 2)
<- 0
jj0 for (jj in 1:nrow(dat2)){
# only consider texts with more than one word
if (length(unlist(seqs[[jj]]))>1){
<- jj0 + 1
jj0 # generate the negative samples ourselves to control the seed
<- skipgrams(sequence=unlist(seqs[[jj]]),
tt vocabulary_size=words, window_size=2, negative_samples=0)
<- matrix(unlist(tt$couples), ncol=2, byrow=TRUE)
xx <- tt$labels
yy <- data.frame(cbind(xx,yy))
gram0 names(gram0) <- c("w1", "w2", "yy")
if (jj0==1){
<- gram0
gram else{
}<- rbind(gram, gram0)
gram }}}
<- gram
skipgram # generate fake center-context pairs by permuting the context word w2
$yy <- 0
gramset.seed(100)
$w2 <- gram[sample(1:nrow(gram)),"w2"]
gram# merge the two samples and randomize the order
<- rbind(skipgram, gram)
skipgram <- skipgram[sample(1:nrow(skipgram)),]
skipgram 1:5,] skipgram[
w1 w2 yy
3779 52 91 1
2931 96 45 1
22557 12 36 0
8282 34 77 1
78 3 52 1
24227 1 93 0
<- function(seed, W1, b1){
network.word2vec $keras$backend$clear_session()
tfset.seed(seed)
set_random_seed(seed)
<- layer_input(shape = c(1), dtype = 'int32')
center <- layer_input(shape = c(1), dtype = 'int32')
context = center %>%
centerEmb layer_embedding(input_dim = W1, output_dim = b1, input_length = 1,
name = 'centerEmb') %>% layer_flatten()
= context %>%
contextEmb layer_embedding(input_dim = W1, output_dim = b1, input_length = 1,
name = 'contextEmb') %>% layer_flatten()
= list(centerEmb, contextEmb) %>%
response layer_dot(axes = 1) %>%
layer_dense(units=1, activation='sigmoid')
keras_model(inputs = c(center, context), outputs = c(response))
}
# center-context pairs input data
<- as.matrix(skipgram$w1-1)
center <- as.matrix(skipgram$w2-1)
context
# embedding dimension
<- 2
b1 <- network.word2vec(seed=100, words, b1)
model %>% compile(loss = "binary_crossentropy", optimizer = "nadam") model
Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param Connected to
#
================================================================================
input_1 (InputLayer) [(None, 1)] 0 []
input_2 (InputLayer) [(None, 1)] 0 []
centerEmb (Embedding) (None, 1, 2) 252 ['input_1[0][0]']
contextEmb (Embedding (None, 1, 2) 252 ['input_2[0][0]']
)
flatten (Flatten) (None, 2) 0 ['centerEmb[0][0]']
flatten_1 (Flatten) (None, 2) 0 ['contextEmb[0][0]']
dot (Dot) (None, 1) 0 ['flatten[0][0]',
'flatten_1[0][0]']
dense (Dense) (None, 1) 2 ['dot[0][0]']
================================================================================
Total params: 506 (1.98 KB)
Trainable params: 506 (1.98 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
<- model %>% fit(list(center, context), skipgram$yy,
fit validation_split=0.2, batch_size=5000, epochs=epochs, verbose=0)
# function to extract the weights by layer name
= function(layer_name){
get_embedding_values = model %>% get_layer(layer_name) %>% get_weights()
embedding = embedding[[1]] %>% data.table()
temp %>% setnames(names(temp),paste0("dim",seq(1:length(names(temp)))))
temp
temp}
# indices of hazards insured
<- c(str_to_lower(sort(unique(dat2$Hazard))), "water")
hazard <- unlist(tokenizer2$word_index)[unlist(tokenizer2$index_word) %in% hazard]
index
index
# extract embedding of center words
<- "center"
variable = get_embedding_values(paste(variable, "Emb", sep=""))
embed_dims
# we print the center word embeddings of the 50 most frequent words
Red color shows the insured hazards. Naturally, we should select higher embedding dimensions, but \(b=2\) can nicely be illustrated.
3.5 Conclusions word2vec
Naturally, one should select higher-dimensional embedding dimensions than \(b=2\), and principal component analysis (PCA) can be used to illustrate these embeddings.
Pre-trained embeddings can be downloaded, e.g., an embedding GloVe is available for embedding dimensions 50, 100, 200, 300. This is trained on large corpus of the internet; Pennington, Socher and Manning (2014).
The difficulty with pre-trained embeddings is that they may be pre-trained in a different context, e.g., ‘policy’ may have different meanings in insurance and machine learning.
Copyright
© The Authors
This notebook and these slides are part of the project “AI Tools for Actuaries”. The lecture notes can be downloaded from:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162304
\(\,\)
- This material is provided to reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution and credit is given to the original authors and source, and if you indicate if changes were made. This aligns with the Creative Commons Attribution 4.0 International License CC BY-NC.